随着LIDAR传感器在自动驾驶中的流行率,3D对象跟踪受到了越来越多的关注。在点云序列中,3D对象跟踪旨在预测给定对象模板中连续帧中对象的位置和方向。在变压器成功的驱动下,我们提出了点跟踪变压器(PTTR),它有效地预测了高质量的3D跟踪,借助变压器操作,以粗到1的方式导致。 PTTR由三个新型设计组成。 1)我们设计的关系意识采样代替随机抽样,以在亚采样过程中保留与给定模板相关的点。 2)我们提出了一个点关系变压器,以进行有效的特征聚合和模板和搜索区域之间的特征匹配。 3)基于粗糙跟踪结果,我们采用了一个新颖的预测改进模块,通过局部特征池获得最终的完善预测。此外,以捕获对象运动的鸟眼视图(BEV)的有利特性(BEV)的良好属性,我们进一步设计了一个名为PTTR ++的更高级的框架,该框架既包含了点的视图和BEV表示)产生高质量跟踪结果的影响。 PTTR ++实质上提高了PTTR顶部的跟踪性能,并具有低计算开销。多个数据集的广泛实验表明,我们提出的方法达到了卓越的3D跟踪准确性和效率。
translated by 谷歌翻译
在点云序列中,3D对象跟踪目的是在给定模板点云的情况下预测当前搜索点云中的对象的位置和方向。通过变压器的成功,我们提出了点跟踪变压器(PTTR),其有效地在变压器操作的帮助下以粗良好的方式预测高质量的3D跟踪结果。 PTTR由三种新颖的设计组成。 1)除了随机抽样中,我们设计关系感知采样,以保护在子采样期间给定模板的相关点。 2)此外,我们提出了一种由自我关注和跨关注模块组成的点关系变压器(PRT)。全局自我关注操作捕获远程依赖性,以便分别增强搜索区域和模板的编码点特征。随后,我们通过横向关注匹配两组点特征来生成粗略跟踪结果。 3)基于粗略跟踪结果,我们采用了一种新颖的预测细化模块来获得最终精制预测。此外,我们根据Waymo Open DataSet创建一个大型点云单个对象跟踪基准。广泛的实验表明,PTTR以准确性和效率达到优越的点云跟踪。
translated by 谷歌翻译
基于图像和视频的3D人类恢复(即姿势和形状估计)取得了实质性进展。但是,由于运动捕获的高度成本,现有的数据集通常受到规模和多样性的限制。在这项工作中,我们通过使用自动注释的3D地面真相玩电子游戏来获得大量的人类序列。具体来说,我们贡献了GTA-Human,这是一种由GTA-V游戏引擎生成的大规模3D人类数据集,具有高度多样化的主题,动作和场景。更重要的是,我们研究游戏玩法数据的使用并获得五个主要见解。首先,游戏数据非常有效。基于框架的简单基线对GTA-Human训练,其优于更复杂的方法的幅度很大。对于基于视频的方法,GTA-Human甚至与内域训练集相当。其次,我们发现合成数据为通常在室内收集的真实数据提供了关键补充。我们对域间隙的调查为简单但有用的数据混合策略提供了解释。第三,数据集的比例很重要。性能提升与可用的其他数据密切相关。一项系统的研究揭示了来自多个关键方面的数据密度的模型敏感性。第四,GTA-Human的有效性还归因于丰富的强制监督标签(SMPL参数),在实际数据集中获取否则它们很昂贵。第五,合成数据的好处扩展到较大的模型,例如更深层次的卷积神经网络(CNN)和变压器,也观察到了重大影响。我们希望我们的工作可以为将3D人类恢复到现实世界铺平道路。主页:https://caizhongang.github.io/projects/gta-human/
translated by 谷歌翻译
Accurate and timely rain prediction is crucial for decision making and is also a challenging task. This paper presents a solution which won the 2 nd prize in the Weather4cast 2022 NeurIPS competition using 3D U-Nets and EarthFormers for 8-hour probabilistic rain prediction based on multi-band satellite images. The spatial context effect of the input satellite image has been deeply explored and optimal context range has been found. Based on the imbalanced rain distribution, we trained multiple models with different loss functions. To further improve the model performance, multi-model ensemble and threshold optimization were used to produce the final probabilistic rain prediction. Experiment results and leaderboard scores demonstrate that optimal spatial context, combined loss function, multi-model ensemble, and threshold optimization all provide modest model gain. A permutation test was used to analyze the effect of each satellite band on rain prediction, and results show that satellite bands signifying cloudtop phase (8.7 um) and cloud-top height (10.8 and 13.4 um) are the best predictors for rain prediction. The source code is available at https://github.com/bugsuse/weather4cast-2022-stage2.
translated by 谷歌翻译
Resistive Random-Access Memory (RRAM) is well-suited to accelerate neural network (NN) workloads as RRAM-based Processing-in-Memory (PIM) architectures natively support highly-parallel multiply-accumulate (MAC) operations that form the backbone of most NN workloads. Unfortunately, NN workloads such as transformers require support for non-MAC operations (e.g., softmax) that RRAM cannot provide natively. Consequently, state-of-the-art works either integrate additional digital logic circuits to support the non-MAC operations or offload the non-MAC operations to CPU/GPU, resulting in significant performance and energy efficiency overheads due to data movement. In this work, we propose NEON, a novel compiler optimization to enable the end-to-end execution of the NN workload in RRAM. The key idea of NEON is to transform each non-MAC operation into a lightweight yet highly-accurate neural network. Utilizing neural networks to approximate the non-MAC operations provides two advantages: 1) We can exploit the key strength of RRAM, i.e., highly-parallel MAC operation, to flexibly and efficiently execute non-MAC operations in memory. 2) We can simplify RRAM's microarchitecture by eliminating the additional digital logic circuits while reducing the data movement overheads. Acceleration of the non-MAC operations in memory enables NEON to achieve a 2.28x speedup compared to an idealized digital logic-based RRAM. We analyze the trade-offs associated with the transformation and demonstrate feasible use cases for NEON across different substrates.
translated by 谷歌翻译
我们介绍了Encoder-Forecaster卷积的长短短期记忆(LSTM)深度学习模型,为微软天气的运营降水Newcasting产品提供动力。该模型作为输入一系列天气雷达马赛克,并确定在最多6小时内的铅倍时确定未来雷达反射率。通过沿着特征维度堆叠大型输入接收领域,并通过从基于物理的高分辨率快速刷新(HRRR)模型的预测,通过预测来调节模型的预测,我们能够在多个度量标准上以20-25%的光流和HRRR基线优于光流量和HRRR基线平均在所有交货时间上。
translated by 谷歌翻译
Compressed videos often exhibit visually annoying artifacts, known as Perceivable Encoding Artifacts (PEAs), which dramatically degrade video visual quality. Subjective and objective measures capable of identifying and quantifying various types of PEAs are critical in improving visual quality. In this paper, we investigate the influence of four spatial PEAs (i.e. blurring, blocking, bleeding, and ringing) and two temporal PEAs (i.e. flickering and floating) on video quality. For spatial artifacts, we propose a visual saliency model with a low computational cost and higher consistency with human visual perception. In terms of temporal artifacts, self-attention based TimeSFormer is improved to detect temporal artifacts. Based on the six types of PEAs, a quality metric called Saliency-Aware Spatio-Temporal Artifacts Measurement (SSTAM) is proposed. Experimental results demonstrate that the proposed method outperforms state-of-the-art metrics. We believe that SSTAM will be beneficial for optimizing video coding techniques.
translated by 谷歌翻译
As one of the most important psychic stress reactions, micro-expressions (MEs), are spontaneous and transient facial expressions that can reveal the genuine emotions of human beings. Thus, recognizing MEs (MER) automatically is becoming increasingly crucial in the field of affective computing, and provides essential technical support in lie detection, psychological analysis and other areas. However, the lack of abundant ME data seriously restricts the development of cutting-edge data-driven MER models. Despite the recent efforts of several spontaneous ME datasets to alleviate this problem, it is still a tiny amount of work. To solve the problem of ME data hunger, we construct a dynamic spontaneous ME dataset with the largest current ME data scale, called DFME (Dynamic Facial Micro-expressions), which includes 7,526 well-labeled ME videos induced by 671 participants and annotated by more than 20 annotators throughout three years. Afterwards, we adopt four classical spatiotemporal feature learning models on DFME to perform MER experiments to objectively verify the validity of DFME dataset. In addition, we explore different solutions to the class imbalance and key-frame sequence sampling problems in dynamic MER respectively on DFME, so as to provide a valuable reference for future research. The comprehensive experimental results show that our DFME dataset can facilitate the research of automatic MER, and provide a new benchmark for MER. DFME will be published via https://mea-lab-421.github.io.
translated by 谷歌翻译
Face Anti-spoofing (FAS) is essential to secure face recognition systems from various physical attacks. However, recent research generally focuses on short-distance applications (i.e., phone unlocking) while lacking consideration of long-distance scenes (i.e., surveillance security checks). In order to promote relevant research and fill this gap in the community, we collect a large-scale Surveillance High-Fidelity Mask (SuHiFiMask) dataset captured under 40 surveillance scenes, which has 101 subjects from different age groups with 232 3D attacks (high-fidelity masks), 200 2D attacks (posters, portraits, and screens), and 2 adversarial attacks. In this scene, low image resolution and noise interference are new challenges faced in surveillance FAS. Together with the SuHiFiMask dataset, we propose a Contrastive Quality-Invariance Learning (CQIL) network to alleviate the performance degradation caused by image quality from three aspects: (1) An Image Quality Variable module (IQV) is introduced to recover image information associated with discrimination by combining the super-resolution network. (2) Using generated sample pairs to simulate quality variance distributions to help contrastive learning strategies obtain robust feature representation under quality variation. (3) A Separate Quality Network (SQN) is designed to learn discriminative features independent of image quality. Finally, a large number of experiments verify the quality of the SuHiFiMask dataset and the superiority of the proposed CQIL.
translated by 谷歌翻译
Interview has been regarded as one of the most crucial step for recruitment. To fully prepare for the interview with the recruiters, job seekers usually practice with mock interviews between each other. However, such a mock interview with peers is generally far away from the real interview experience: the mock interviewers are not guaranteed to be professional and are not likely to behave like a real interviewer. Due to the rapid growth of online recruitment in recent years, recruiters tend to have online interviews, which makes it possible to collect real interview data from real interviewers. In this paper, we propose a novel application named EZInterviewer, which aims to learn from the online interview data and provides mock interview services to the job seekers. The task is challenging in two ways: (1) the interview data are now available but still of low-resource; (2) to generate meaningful and relevant interview dialogs requires thorough understanding of both resumes and job descriptions. To address the low-resource challenge, EZInterviewer is trained on a very small set of interview dialogs. The key idea is to reduce the number of parameters that rely on interview dialogs by disentangling the knowledge selector and dialog generator so that most parameters can be trained with ungrounded dialogs as well as the resume data that are not low-resource. Evaluation results on a real-world job interview dialog dataset indicate that we achieve promising results to generate mock interviews. With the help of EZInterviewer, we hope to make mock interview practice become easier for job seekers.
translated by 谷歌翻译